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1.  Introduction 


A  number  of  reinforcement  learning  systems  have  been  proposed  recently,  such  as  the  associative  control 

process  (ACP)  network  (Klopf,  Morgan,  and  Weaver,  1993a,  1993b,  Baird  and  Klopf,  1993a,  1993b),  ADHDP 

(Werbos,  1989),  Dyna  (Sutton,  1990),  other  systems  described  in  Barto  and  Bradtke  (1991)  based  on  Q~learning 

(Watkins,  1989,  Watkins  and  Dayan,  1992),  and  systems  based  on  advantage  updating  (Baird  1993).  These 

systems  learn  to  be  optimal  controllers  of  nonlinear  plants,  typically  requiring  that  a  function /(x,  u)  be  learned. 
They  also  require  that  the  value  of  argmax /(x,u)  be  calculated  repeatedly  for  various  values  of  x,  both  during 

learning,  and  when  using  the  system  as  a  controller.  If  the  state  x  and  action  u  are  discretized,  then  the  function 
can  be  represented  as  a  finite  lookup  table.  If  the  state  x  is  a  real- valued  vector,  then  the  function  can  be 
represented  using  standard  function-representation  techniques  such  as  multilayer  perceptions,  radial  basis  function 
networks,  and  memory-based  learning  and  interpolation  systems  (Atkeson,  1990).  However,  if  the  action  u  is 
also  a  real-valued  vector,  then  finding  the  maximum  is  extremely  difficult  with  most  function  approximation 
systems.  Although  Tesauro's  TD-Gammon  program  (Tesauro,  1990,  1992)  demonstrates  that  some  difficult 
problems  can  be  solved  using  discrete  values  for  u,  most  practical  problems  require  real-valued  vectors.  The 
optimization  algorithm  described  in  Baird  (1992)  can  approximate  the  maximum  for  optimal  control  problems  (or 
the  saddle  point  for  differential  games),  but  there  may  be  errors  in  the  maximization  during  learning.  Systems 
using  the  stochastic  real-valued  unit  (Gullapalli,  1990, 1991)  or  the  Analog  Learning  Element  (Millington,  1991) 
can  learn  real-valued  actions  without  maximizing  learned  functions,  but  they  require  the  use  of  a  particular 
exploration  scheme.  It  is  desirable  for  a  system  to  be  able  to  learn  under  any  exploration  scheme  that  tries  all 
actions  in  all  states  sufficiently  often.  Q-leaming  and  advantage  updating,  for  example,  have  this  property.  Also, 
it  would  be  useful  if  the  power  of  any  general  function  approximation  system  could  be  harnessed  to  learn  the 
function,  while  still  allowing  the  maximum  of  the  function  to  be  found  quickly  and  exactly.  A  method  is 
proposed,  wire  fitting,  that  has  these  desirable  properties. 
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2.  MAXIMIZATION  OF  A  FUNCTION 


First  consider  the  simpler  problem  of  learning  a  function  f(u)  such  that  it  is  possible  to  quickly  find  the 
maximum  of  the  function.  Figure  1  shows  one  approach  to  solving  this  problem. 


t(u) 


Figure  1.  Method  for  storing  a  function /(u)  such  that  the  maximum  can  be  found  quickly. 


The  shape  of  the  function  is  determined  by  three  control  points  (circles).  Six 
parameters  ui,U2^3,yi  Jl,y3  are  initialized  to  arbitrary  values.  As  training  samples 
are  observed,  the  six  parameters  are  adjusted  so  that/(u)  (dotted  line)  is  a  good  fit 
to  the  training  data.  The  value  of  f{u)  at  point  u  is  defined  as  a  weighted  average  of 
the  three  y/  values,  weighted  by  distance  between  u  and  ui,  and  also  by  the  distance 
between  y/  and  ymax’  This  ensures  that  the  maximum  of  f{u)  always  occurs  at  one 
of  the  control  points,  (u,\y,). 

The  shape  of  the  function /(n)  is  controlled  by  six  parameters  which  specify  the  location  of  three  control 
points.  The  functiony(u)  is  defined  as: 


/(«)  = 


< 

[|«-«.|+maxy*-y, 

r 

?[l 

^-tt..|-i-maxy*-y,] 

-1 

(1) 


The  function  is  defined  by  a  weighted-nearest-neighbor  interpolation  of  the  three  control  points.  If  equation 
(1)  is  undefined  for  a  given  value  of  u,  then/Cu)  is  defined  to  be  maxy^  for  that  value  of  u.  The  function  may  not 

i 

go  through  every  control  point,  but  it  is  guaranteed  to  go  through  the  highest  point.  Also,  the  function  is 
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guaranteed  never  to  go  above  the  highest  point  or  below  the  lowest  point  Therefore,  the  maximum  of  the 
function  is  guaranteed  to  be  located  at  the  Ui,  which  has  the  same  subscript  as  the  maximum  yi  value. 

This  function  approximation  system  resembles  a  memory  based  learning  system,  but  is  different.  In  a 
memory  based  learning  system,  a  set  of  training  data  is  stored  and  interpolated  to  give  the  function  y(u).  In  the 
system  described  here,  the  control  points  are  initialized  to  arbitrary  values.  Then,  as  training  data  is  observed,  the 
control  points  shift  until /(u)  approximates  the  training  data.  For  example,  if  all  of  the  training  data  lies  on  the 
curve  shown  in  Figure  1,  then  a  gradient-descent  learning  algorithm  will  learn  to  place  the  three  control  points  as 
shown  in  Figure  1.  The  control  point  (U3,y3),  therefore,  learns  to  be  much  lower  than  any  of  the  training  data. 
Equation  (1)  might  not  be  a  good  algorithm  for  interpolating  raw  training  data,  but  it  may  be  useful  for  learning  if 
the  control  points  shift  during  learning.  The  maximum  of  the  curve/can  be  found  in  even  less  time  than  it  takes 
to  evaluatey(u)  for  an  arbitrary  u,  because  the  maximum  can  be  found  without  using  equation  (1). 

There  may  be  uses  for  a  system  that  can  learn /(u)  and  find  the  maximum.  It  is  more  useful,  however,  to 
have  a  system  that  can  learn /(x,u)  and  can  find  the  u  that  maximizes  the  function  for  any  given  x.  This  can  be 
done  using  the  same  method  shown  above,  but  with  the  parameters  Ui  and  yi  replaced  with  functions  u/(x)  and 
yj<jc).  In  this  case,  the  control  points  become  control  wires  in  a  higher-dimensional  space,  and  the  function  is  a 
surface  flt^  to  those  wires. 
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3.  MAXIMIZATION  OF  A  CROSS  SECTION 


Wire  fitting  is  a  function  approximation  method  designed  to  facilitate  finding  the  maximum  of  the  function 
fix,  u)  for  any  given  x.  When  using  wire  fitting,  the  function  y(x,  u)  is  evaluated  for  a  given  x  and  u  as  shown 
in  Figure  2. 


x  u 

Figure  2.  The  wire  fitting  architecture. 

A  function  approximation  system  learns  the  function  in  the  lower  block.  Given  the 
state  X,  this  generates  a  set  of  control  points.  The  interpolating  function  then  fits  a 
function  to  the  set  of  control  points  and  calculates /(x,u),  in  the  same  manner  as  in 
Figure  1. 

Any  general  function  approximation  system  can  be  used  to  learn  the  function  marked  "learned  function"  in 
Figure  2.  This  function  generates  a  set  of  control  points  based  upon  the  value  of  x.  A  function  is  then  fitted  to  the 
set  of  control  points,  and  the  value  of/is  then  calculated  from  u  in  the  manner  illustrated  in  the  previous  section. 
Since  there  is  a  set  of  control  points  for  every  possible  x,  the  control  points  are  actually  control  curves,  or  wires, 
in  a  higher-dimensional  space.  Thus,  the  function  is  actually  being  frtted  to  a  set  of  wires  rather  than  to  a  set  of 
points.  The  action  u  and  the  functions  u/  are  all  vectors  with  the  same  number  of  elements.  The  state  x  is  also  a 
vector,  possibly  with  a  diffarent  number  of  elements.  The  function  /and  the  functions  y.  are  all  scalars,  and /is  a 
weighted  average  of  the  set  of  y,  .  In  a  reinforcement  learning  system,  the  function /(x,  u)  typically  represents  the 
utility  of  performing  action  u  in  state  x,  so  the  u  that  maximizes y(x,u)  is  the  optimal  action  to  perform  in  state  x. 
The  lower  box  in  Figure  2  can  be  any  function  approximation  system,  such  as  a  multilayer  perception  trained  by 
backpropagation.  Its  only  input  is  the  state  x.  Its  output  is  a  set  of  vector  pairs  (u,-,  yj,  which  control  the  shape 
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of  the  function  in  state  x.  Equation  (2)  is  a  continuous,  smooth  function  of  its  inputs,  so  it  is  possible  to 
backprt^agate  errors  in /back  through  equaticm  (2)  to  update  weights  in  the  learning  system: 


/(X,U)=lOT 


i 

i(x)f  +  Ci(y«„(x)  -  x(x))  +  e] 


(2) 


For  a  given  state,  the  set  of  vector  pairs  (u/,  y,  )  arc  interpolated  to  give/;x,u).  The  value  is  simply  the 
maximum  of  the  y,  values.  Equation  (2)  defines y(x,u)  fen*  a  particular  u  to  be  a  weighted  average  of  y,  values. 
If  u  is  near  a  particular  Uj,  then  the  corresponding  y^  is  given  more  weight  The  nonnegative  constant  parameters 
a  determine  the  amount  of  smoothing.  If  all  Ci  =0,  then  the  interpolation  "honors  the  data",  and /=yj  when  u =U{. 
If  the  Ci  values  are  positive,  the  interpolated  function  is  smoother,  and  /may  not  be  exactly  equal  to  y^  even 
when  u=Uj..  The  constants  Ci  can  be  chosen  a  priori,  or  they  can  be  learned.  As  will  be  shown  below,  if  the 
learned  function  is  trained  with  a  memeny-based  learning  method,  then  the  values  for  c/  can  be  chosen  arbitrarily, 
with  no  effect  on  learning  or  performance.  The  limit  in  equation  (2)  is  merely  for  mathematical  completeness.  It 
ensures  that  the  function  is  defined  when  u=U{.  The  equation  can  be  written  without  the  limit  and  e,  if  it  is  stated 
that/(x,u)=y,  whenever  the  coefficient  of  y,  in  the  summation  would  be  undefined. 

The  control  points  (Uj,  yj  serve  to  shape  the  function  in  a  given  state.  Each  control  point  plays  a  role 
analogous  to  a  knot  for  a  spline  or  a  data  point  for  an  interpolation  function.  It  is  also  analogous  to  the  parameters 
associated  with  one  radial  basis  function  in  a  radial  basis  function  network.  In  each  case,  the  parameters  have  a 
local  effect  on  the  shape  of  the  function.  However,  Equation  2  has  one  property  that  distinguishes  it  from  other 
interpolation  algorithms.  No  matter  what  values  the  vector  pairs  have,  it  is  always  the  case  that: 

max/(x,u)  =  mpyXx)5y^(x)  (3) 


This  is  easily  proved.  First,  consider  a  value  of  u  not  equal  to  any  uj .  In  this  case,  the  expression  in 
Equation  1  is  defined  for  e=0.  /is  then  a  weighted  average  of  the  y^ ,  with  each  weight  between  zero  and  one  and 

the  sum  of  the  weights  equal  to  one.  A  weighted  average  of  several  numbers  cannot  exceed  the  largest  number,  so 
/is  less  than  or  equal  to  the  maximum  y^,  which  is  y^.  Next,  consider  the  case  in  which  u  is  equal  to  Uff,ax» 
where  timaz>  is  the  u/  with  the  same  subscript  as  y^.  In  this  case,  as  e  goes  to  zero,  the  sum  in  the  numerator 
ccxnes  to  be  dominated  by  the  term  containing  u^ax  and  y^ ,  so  in  the  limit  f=y^ .  Lastly,  consider  the  case  in 
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which  us:ui  ^Amax'  By  a  similar  argument,  if  C|=0,  then If  c^O,  then /is  simply  a  weighted  sum  of 
y^,  so f^y^.  Thus,  when  u--u«„«,/=y^,  and  for  every  u^U;^,/<y^.  Therefore,  Equation  3  is  true. 

Given  this  method  for  representing  a  function/,  it  is  possible  to  implement  a  reinforcement  learning  system 
that  learns  from  any  sequence  of  actions.  Any  function  approximation  system  can  be  used  as  the  lower  box  in 
Figure  2.  The  system  in  Figure  2  can  be  used  to  quickly  calculate  the/ value  for  a  given  state-action  pair,/(x,  u), 
or  the  optimal  action  in  a  state,  Umax(x),  or  the  maximum/ value  for  a  state,  (x).  If  action  u  is  performed  in 

state  X,  theny(x,  u)  can  be  calculated  immediately.  On  the  next  time  step  (or  several  time  steps  later  for  multistep 
learning),  an  improved  estimate  can  be  calculated  for  /(x,  u)  by  the  reinforcement  learning  algorithm,  using  the 
value  of  the  new  states  and  the  reinforcement  received.  This  can  be  used  to  calculate  an  error  in  /(x,  u).  If  the 
learning  system  is  gradient-based,  then  the  error  can  be  propagated  back  through  Equation  2  and  through  the 
learned  function,  so  that  /(x,  u)  moves  toward  the  improved  estimate  for  /(x,  u).  Thus,  this  method  for 
represcnting/is  flexible,  and  can  be  incorporated  in  a  variety  of  reinforcement  learning  systems. 

This  method  for  representing  the  function /[x,  u)  can  be  represented  graphically,  as  shown  in  Rgure  3. 
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u 


Figure  3.  An  example  of  a  function /(x,u)  whose  shape  is  determined  by  three  wires. 

In  any  given  state,  such  as  xq,  the  wires  intersect  the  plane  of  that  state  at  three 
points.  These  three  points  are  the  control  points  that  determine  the  shape  of  the 
function  for  that  value  of  x.  The  shape  of  the  function  in  that  plane  is  determined 
by  the  location  of  the  three  wires,  and  the  function  is  guaranteed  to  pass  through  the 
point  (Uniax(xo)»  ynuaC^o))*  which  in  this  example  is  the  point  (u2(xo),  yzC^o))- 

The  upper  graphs  in  Figure  2  show  an  example  of  a  function /(x,u),  where  x  and  u  are  scalars.  The  graph 
on  the  left  is  the /function  itself,  while  the  one  on  the  right  shows  three  control  wires  superimposed  on  the 
picture.  The  lower  graph  is  a  cross  section  of  the  function,  taken  at  state  xq.  The  set  of  all  points  of  the  form  (x, 
Ui(x).  ViOO)  forms  the  /th  wire  in  3-D  space.  The  shape  of  the  surface  is  then  determined  by  the  shape  and 

location  of  the  control  wires.  The  shape  of  the  function  in  this  example  is  determined  by  three  wires:  A  high. 


curved  wire  (dark  gray),  a  medium,  curved  wire  (light  gray),  and  a  low,  straight  wire  (black).  Although  the 
surface  does  not  touch  the  wires  at  every  point,  it  is  drawn  toward  them,  and  so  consists  of  two  intersecting  ridges 
with  a  valley  between  them.  Where  the  ridges  intersect,  the  surface  rises  to  the  highest  wire.  In  this  example  each 
wire  has  a  constant  height  but,  in  general,  a  wire  could  have  a  varying  height.  The  lower  picmre  in  Figure  2 
shows  a  cross  section  of  the  graph  on  the  right  for  a  particular  state,  xq.  Each  wire  intersects  the  plane  of  xo  at  a 
point,  so  the  three  wires  define  three  control  points.  The  learning  system  learns  the  location  of  each  control  point 
in  each  state.  The  surface  is  defined  by  Equation  2,  which  ensures  that  the  highest  point  on  the  surface  will  lie  on 
one  of  the  control  points  within  the  cross  section  at  any  given  state. 
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4.  Memory-Based  learning 


TTie  method  presented  here  for  representing  a  function  can  be  used  with  a  variety  of  function  representation 
systems.  It  is  clear  how  it  could  be  used  with  a  gradient-based  function  approximatitm  system.  The  error  in /can 
be  propagated  back  through  equation  (2)  (which  is  differentiable),  to  change  the  weights  in  the  learning  system. 
This  causes  the  control  wires  to  shift  until  the  surface  has  the  appropriate  shape  to  minimize  the  mean  squared 
error  in/.  It  may  be  less  clear  how  it  could  be  used  with  a  memory-based  function  approximation  system,  so  we 
elaborate  upon  that  alternative  in  this  section. 

For  a  memory-based  function  approximation  system,  the  stored  information  will  comprise  a  set  of  triplets  (x/, 
Uj,  £/).  If  action  Ut  is  performed  in  state  Xt  at  time  t,  the  system  will  output /(X(,U/).  The  reinforcement  learning 
algorithm  then  calculates  an  estimate  Et  of  what/(x/,U()  should  have  been,  based  on  the  results  of  performing 
action  U/  in  state  X/.  Once  this  estimate  has  been  calculated,  the  triplet  (x/,  U/,  Ed  can  be  stored.  The  functions 
Uj(x)  andy,(x)  can  be  calculated  from  the  set  of  stored  memories.  If  old  memories  are  eventually  lost,  perhaps 
because  of  a  finite-sized  memory  set,  then  the  U|(x)  and  y/fx)  functions  would  be  expected  to  improve  with 
experience,  yielding  memory-based  learning. 

Memory-based  learning  has  an  advantage  relative  to  gradient  learning  systems  when  used  with  wire  fitting.  It 
is  possible  to  calculate  and  store  each  triplet  without  calculating /(x,  u).  In  a  gradient  learning  system,  the  output 
of  the  system  must  be  calculated  so  that  an  error  can  be  found  to  drive  learning.  In  a  memory-based  system, 
examples  of  inputs  and  desired  outputs  are  simply  stored,  and  the  actual  outputs /(x,  u)  need  not  be  calculated. 
Thus,  for  the  particular  case  of  a  memory-based  learning  system,  Equation  2  need  never  be  evaluated.  This  not 
only  saves  calculation  time,  but  also  simplifies  the  system  because  the  constants  c/  do  not  have  to  be  chosen  or 
learned. 

An  important  question  for  a  memory-based  system  is  that  of  how  the  functions  U|(x)  and  y,(x)  can  be 
calculated  from  the  set  of  stored  data.  In  Figure  3,  this  would  correspond  to  the  question  of  how  several  wires 
can  be  created  that  will  generate  a  surface  that  is  a  reasonable  approximation  to  a  set  of  data  points  scattered 
throughout  the  cube.  If  there  are  n  functions  Ui(x)  and  yKx),  then  every  state  will  intersect  n  of  the  vires.  One 
possible  solution  is  presented  next 

For  a  given  state  x,  the  functions  u/fx)  and  yKx)  are  defined  by  Equations  4  through  10.  If  there  are  n  wires, 

then  there  will  be  a  wire  associated  with  each  of  the  n  data  points  nearest  to  state  x  (Euclidean  distance).  The  ith 
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wire  will  not  necessarily  go  through  the  ith  data  point,  but  U((x)  will  typically  be  fairly  close  to  the  u  compcment 
of  the  associated  data  point.  In  the  equations  that  follow,  r  is  an  index  that  ranges  over  all  stored  data  points.  The 
index  i  ranges  over  diose  data  points  that  are  associated  with  wires.  States  and  actions  are  vectors.  The  subscript 
k  represents  the  Ath  element  of  an  action  vector,  and  the  subscript  L  represents  the  Lth  element  of  a  state  vector 


=  X(*t  -  *0.)" + ZK  - 

Li.  * 

(4) 

k". 

j 

(5) 

__  Xyj^A 

j 

(7) 

Ua(x)  =  +  am^ 

(9) 

II 

(6) 

_ 

‘‘"u.-(55.r 

(8) 

J’iW=5’i  +  oS'"« 

(10) 

k 


Each  of  the  n  data  points  (X{,  U(,  £/)  is  projected  into  the  plane  of  the  current  state  x,  to  gi\  ^  a  projected  point 
(x,  U(,  £{).  These  points  are  locations  where  estimates  of  the  value  of  /should  be  most  reliable.  All  of  the  data 
points  (not  just  the  n  closest)  have  an  effect  on  the  wire  associated  with  each  projected  point  The  effect  of  the  nh 
data  point  on  the  ith  wire  is  inversely  proportional  to  its  distance  from  the  projection,  and  is  given  by  Equation  4. 
Equations  5  through  8  perform  weighted  linear  regression.  This  gives  an  estimate  of  the  direction  one  should 
move  from  the  projected  point  to  maximize  the  function /(x,  u).  Equations  9  and  10  place  the  location  of  the  ith 
wire  (u/(x),  y/(x))  near  the  projected  point,  slightly  uphill  in  the  direction  found  by  weighted  linear  regression. 
Thus,  each  wire  comprises  a  local  estimate  of  an  action  that  would  maximize/,  and  an  estimate  of  the/value  for 
that  action.  The  linear  regression  is  done  separately  for  each  dimension.  For  high-dimensional  action  vectors, 
this  is  less  computationally  intensive  than  doing  multidimensional  linear  regression.  The  results  are  the  same 
when  the  stored  values  (Xt,  Ut)  are  evenly  distributed  (have  zero  covariance).  If  the  state-action  space  is  explored 
unevenly,  then  the  stored  values  may  not  be  evenly  distributed,  and  it  may  be  necessary  to  perform  an  affine 
transformatitHi  on  the  data  to  give  zero  covariance. 
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5.  Simulation  results 


The  wire  fitting  approach  was  tested  by  incorporating  it  into  a  reinforcenient  learning  system  used  to  control 
an  inverted  pendulum  hinged  to  a  cart  moving  on  an  infinite  track.  Q-leaming  was  used  for  the  reinforcement 
learning  algorithm,  and  a  memory-based  learning  system  was  used  as  the  function  approximation  system.  The 
equations  for  the  cart-pole  system  are: 

+  m^x-¥  mjidcosd  —  mjid^  sand  =  f  —  (11) 


+  m^lxcosd-  m^l  sin  d  = 


(12. 


where: 

X 

— 

position  of  the  cart  (m) 

(p 

= 

pole  angle  (rad) 

g 

= 

9.8  m/s2 

acceleration  due  to  gravity 

ntc 

1.0  kg 

mass  of  the  cart 

nip 

= 

0.1  kg 

mass  of  the  pole 

1 

= 

0.5  m 

pole  half-length 

Ik 

= 

0.0005  N 

friction  between  can  and  track 

= 

0.000002  N  m  s 

friction  between  pole  and  can 

!/i 

< 

10.0  N 

force  applied  to  can 

The  cart-pole  system  was  simulated  by  Euler  integration  at  50  Hz.  Reinforcement  was  proportional  to  the 
pole  angle  squared,  with  an  additional  negative  reinforcement  when  the  pole  exceeded  12  degrees  from  vertical. 
The  learning  system  was  allowed  to  learn  for  only  60  seconds  of  simulated  time,  during  which  a  random  action  in 
the  range  [-10,10]  newtons  was  chosen  with  uniform  probability  on  each  time  step.  This  training  data  contained 
information  on  only  a  small  portion  of  the  state  space,  so  the  learning  system  was  forced  to  generalize.  The 
learning  system  was  able  to  balance  the  pole  indefinitely  after  60  seconds  of  training  time,  after  which  learning 
was  disabled.  When  the  learning  system  was  applied  to  a  finite-track,  cart-pole  problem,  it  was  not  able  to  learn 
to  control  the  cart  and  pole  consistently.  This  appears  to  be  due  to  the  fact  that  a  time  step  was  only  0.02  second. 
Baird  (1993)  explains  why  ^-learning  cannot  learn  in  continuous  time  (or  discrete  time  with  small  time  steps),  and 
proposes  a  new  algorithm,  advantage  updating,  which  does  not  have  this  limitation.  Advantage  updating  could  be 
combined  with  wire  fitting  and  a  function  approximation  system;  this  remains  an  area  for  future  research. 


6.  CONCLUSION 


We  have  proposed  wire  fittings  a  new  method  for  representing  functions  using  any  general  function 
approximation  system.  This  method  solves  the  maximization  problem  arising  in  reinforcement  learning  systems 
and  offers  several  other  advantages.  We  have  presented  an  example  of  a  memory-based  system  that  may  be  used 
with  the  method  to  represent  Q  functions,  and  have  shown  how  the  method,  combined  with  the  memory-based 
system,  can  be  used  fra:  reinforcement  learning  on  a  cart-pole  control  problem. 
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